Overview

Dataset Statistics

Number of Variables 12
Number of Rows 10000
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 937.6 KB
Average Row Size in Memory 96.0 B
Variable Types
  • Numerical: 12

Dataset Insights

jaro_distance is skewed Skewed
jaro_winkler_distance is skewed Skewed
overlap_coefficient_distance is skewed Skewed
soft_tfidf_distance is skewed Skewed

Variables


levenshtain_distance

numerical

Approximate Distinct Count 10000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 160000
Mean 0.515
Minimum 1.1267e-05
Maximum 0.9352
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • levenshtain_distance is skewed left (γ1 = -1.0036)

Quantile Statistics

Minimum 1.1267e-05
5-th Percentile 0.0283
Q1 0.4271
Median 0.5612
Q3 0.6556
95-th Percentile 0.7769
Maximum 0.9352
Range 0.9352
IQR 0.2285

Descriptive Statistics

Mean 0.515
Standard Deviation 0.2058
Variance 0.04236
Sum 5150.1337
Skewness -1.0036
Kurtosis 0.4994
Coefficient of Variation 0.3996
  • levenshtain_distance is not normally distributed (p-value 0.00151310276285575)
  • levenshtain_distance has 826 outliers

needleman_wunsch_distance

numerical

Approximate Distinct Count 10000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 160000
Mean 0.6758
Minimum 8.3416e-05
Maximum 1.3416
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • needleman_wunsch_distance is skewed left (γ1 = -0.6583)

Quantile Statistics

Minimum 8.3416e-05
5-th Percentile 0.03945
Q1 0.5421
Median 0.7104
Q3 0.8596
95-th Percentile 1.0931
Maximum 1.3416
Range 1.3415
IQR 0.3175

Descriptive Statistics

Mean 0.6758
Standard Deviation 0.2844
Variance 0.08086
Sum 6758.0349
Skewness -0.6583
Kurtosis 0.2005
Coefficient of Variation 0.4208
  • needleman_wunsch_distance is not normally distributed (p-value 0.0019417758588998874)
  • needleman_wunsch_distance has 818 outliers

affine_gap_distance

numerical

Approximate Distinct Count 10000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 160000
Mean 0.5708
Minimum 5.825e-05
Maximum 1.0737
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • affine_gap_distance is skewed left (γ1 = -0.8365)

Quantile Statistics

Minimum 5.825e-05
5-th Percentile 0.03338
Q1 0.4665
Median 0.6103
Q3 0.7284
95-th Percentile 0.8903
Maximum 1.0737
Range 1.0736
IQR 0.2618

Descriptive Statistics

Mean 0.5708
Standard Deviation 0.2329
Variance 0.05424
Sum 5707.7567
Skewness -0.8365
Kurtosis 0.3426
Coefficient of Variation 0.408
  • affine_gap_distance is not normally distributed (p-value 0.0022307498411491167)
  • affine_gap_distance has 813 outliers

smith_waterman_distance

numerical

Approximate Distinct Count 10000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 160000
Mean 0.5821
Minimum 1.3222e-05
Maximum 0.9125
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • smith_waterman_distance is skewed left (γ1 = -1.3605)

Quantile Statistics

Minimum 1.3222e-05
5-th Percentile 0.02928
Q1 0.5107
Median 0.6473
Q3 0.7264
95-th Percentile 0.8176
Maximum 0.9125
Range 0.9124
IQR 0.2157

Descriptive Statistics

Mean 0.5821
Standard Deviation 0.2185
Variance 0.04774
Sum 5821.407
Skewness -1.3605
Kurtosis 1.1454
Coefficient of Variation 0.3753
  • smith_waterman_distance has 918 outliers

jaro_distance

numerical

Approximate Distinct Count 10000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 160000
Mean 0.9847
Minimum 0.9092
Maximum 0.9955
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • jaro_distance is skewed left (γ1 = -3.4211)

Quantile Statistics

Minimum 0.9092
5-th Percentile 0.9662
Q1 0.9835
Median 0.9878
Q3 0.9904
95-th Percentile 0.9942
Maximum 0.9955
Range 0.08632
IQR 0.006893

Descriptive Statistics

Mean 0.9847
Standard Deviation 0.01164
Variance 0.00013556
Sum 9846.8098
Skewness -3.4211
Kurtosis 14.2585
Coefficient of Variation 0.01182
  • jaro_distance is not normally distributed (p-value 2.1357797119225363e-10)
  • jaro_distance has 856 outliers

jaro_winkler_distance

numerical

Approximate Distinct Count 10000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 160000
Mean 0.1189
Minimum 3.6341e-06
Maximum 0.5153
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • jaro_winkler_distance is skewed right (γ1 = 0.8848)

Quantile Statistics

Minimum 3.6341e-06
5-th Percentile 0.001826
Q1 0.009472
Median 0.01945
Q3 0.2885
95-th Percentile 0.4014
Maximum 0.5153
Range 0.5153
IQR 0.279

Descriptive Statistics

Mean 0.1189
Standard Deviation 0.1554
Variance 0.02414
Sum 1188.6747
Skewness 0.8848
Kurtosis -1.0022
Coefficient of Variation 1.307
  • jaro_winkler_distance is not normally distributed (p-value 2.51673079389157e-16)

overlap_coefficient_distance

numerical

Approximate Distinct Count 10000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 160000
Mean 0.2563
Minimum 5.6339e-06
Maximum 0.8962
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • overlap_coefficient_distance is skewed right (γ1 = 0.338)

Quantile Statistics

Minimum 5.6339e-06
5-th Percentile 0.007622
Q1 0.03787
Median 0.2204
Q3 0.4155
95-th Percentile 0.5861
Maximum 0.8962
Range 0.8962
IQR 0.3776

Descriptive Statistics

Mean 0.2563
Standard Deviation 0.2
Variance 0.03999
Sum 2562.7854
Skewness 0.338
Kurtosis -1.0352
Coefficient of Variation 0.7803
  • overlap_coefficient_distance is not normally distributed (p-value 2.7924960180227267e-10)

generalized_jaccard_distance

numerical

Approximate Distinct Count 10000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 160000
Mean 0.5545
Minimum 0.00015525
Maximum 0.9519
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • generalized_jaccard_distance is skewed left (γ1 = -1.1398)

Quantile Statistics

Minimum 0.00015525
5-th Percentile 0.02759
Q1 0.4741
Median 0.6107
Q3 0.7116
95-th Percentile 0.7992
Maximum 0.9519
Range 0.9517
IQR 0.2375

Descriptive Statistics

Mean 0.5545
Standard Deviation 0.2209
Variance 0.04879
Sum 5545.2462
Skewness -1.1398
Kurtosis 0.4968
Coefficient of Variation 0.3983
  • generalized_jaccard_distance has 852 outliers

tfidf_distance

numerical

Approximate Distinct Count 10000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 160000
Mean 0.6452
Minimum 8.4367e-05
Maximum 0.974
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • tfidf_distance is skewed left (γ1 = -1.4896)

Quantile Statistics

Minimum 8.4367e-05
5-th Percentile 0.03002
Q1 0.578
Median 0.7181
Q3 0.8054
95-th Percentile 0.8783
Maximum 0.974
Range 0.9739
IQR 0.2273

Descriptive Statistics

Mean 0.6452
Standard Deviation 0.2357
Variance 0.05555
Sum 6452.4798
Skewness -1.4896
Kurtosis 1.4795
Coefficient of Variation 0.3653
  • tfidf_distance has 889 outliers

soft_tfidf_distance

numerical

Approximate Distinct Count 10000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 160000
Mean 0.9874
Minimum 0.9092
Maximum 0.9983
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • soft_tfidf_distance is skewed left (γ1 = -3.488)

Quantile Statistics

Minimum 0.9092
5-th Percentile 0.9672
Q1 0.9866
Median 0.9907
Q3 0.9932
95-th Percentile 0.997
Maximum 0.9983
Range 0.08914
IQR 0.006642

Descriptive Statistics

Mean 0.9874
Standard Deviation 0.0122
Variance 0.00014877
Sum 9873.9898
Skewness -3.488
Kurtosis 14.6305
Coefficient of Variation 0.01235
  • soft_tfidf_distance is not normally distributed (p-value 3.0764986049316374e-11)
  • soft_tfidf_distance has 927 outliers

partial_ration_distance

numerical

Approximate Distinct Count 10000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 160000
Mean 0.2922
Minimum 8.6548e-06
Maximum 0.7388
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • partial_ration_distance is skewed left (γ1 = -0.2641)

Quantile Statistics

Minimum 8.6548e-06
5-th Percentile 0.01251
Q1 0.1462
Median 0.3264
Q3 0.4187
95-th Percentile 0.5362
Maximum 0.7388
Range 0.7388
IQR 0.2725

Descriptive Statistics

Mean 0.2922
Standard Deviation 0.1702
Variance 0.02896
Sum 2921.9579
Skewness -0.2641
Kurtosis -0.9402
Coefficient of Variation 0.5824

bag_distance_distance

numerical

Approximate Distinct Count 10000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 160000
Mean 0.4005
Minimum 6.8559e-05
Maximum 0.8916
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • bag_distance_distance is skewed left (γ1 = -0.1269)

Quantile Statistics

Minimum 6.8559e-05
5-th Percentile 0.0285
Q1 0.2778
Median 0.4042
Q3 0.534
95-th Percentile 0.7184
Maximum 0.8916
Range 0.8915
IQR 0.2561

Descriptive Statistics

Mean 0.4005
Standard Deviation 0.1933
Variance 0.03738
Sum 4005.0578
Skewness -0.1269
Kurtosis -0.402
Coefficient of Variation 0.4827
  • bag_distance_distance is not normally distributed (p-value 7.956223941072542e-08)

Interactions

Correlations

Missing Values